E-Commerce Analytics: Insights for Growth & Retention¶

Introduction¶

In this project, I explored online retail transactions to understand customer behavior, sales patterns, and opportunities for growth. Instead of presenting only numbers and charts, this notebook tells a story: every analysis is followed by insights and strategies that can directly guide business decisions.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Dataset Information¶

The Online Retail dataset from the UCI Machine Learning Repository contains ~542k transactions from a UK-based non-store retailer between Dec 2010 and Dec 2011. The company mainly sells unique all-occasion gifts, and many customers are wholesalers.

Data Loading & Cleaning¶

First, I loaded the dataset. To ensure our analysis is accurate, I cleaned the data by removing invalid transactions and missing customer IDs.

Load dataset¶

In [2]:
df = pd.read_excel("Online Retail.xlsx")

Data Cleaning¶

In [3]:
# Drop missing CustomerID and invalid transactions
df = df.dropna(subset=["CustomerID"])
df = df[(df["Quantity"]>0) & (df["UnitPrice"]>0)]

# Convert InvoiceDate to datetime
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])

# Feature: TotalPrice
df["TotalPrice"] = df["Quantity"] * df["UnitPrice"]

# Get the latest date
snapshot_date = df["InvoiceDate"].max()

print("Cleaned dataset shape:", df.shape)
Cleaned dataset shape: (397884, 9)

After cleaning, we have ~398k transactions across 9 columns, ready for analysis.

In [4]:
df.head()
Out[4]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country TotalPrice
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34

Top Products Analysis¶

Next, I asked: “Which products generate the most revenue?”

I aggregated total sales by product and visualized the top 10 best-sellers.

In [5]:
top_products = df.groupby("Description")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_products[::-1], x=top_products.values[::-1], y=top_products.index[::-1],
                      orientation='h', title="Top 10 Products by Revenue").show()

We found that a small group of products, such as Paper Craft, Little Birdie and Regency Cakestand 3 Tier, contributed disproportionately to total revenue. This shows that the business relies heavily on a few key items.

Strategy: Highlight these key items in marketing campaigns and create premium bundles to increase basket value.

Country-Level Revenue Analysis¶

I then explored: “Where do most of our sales come from?”

In [6]:
top_countries = df.groupby("Country")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_countries[::-1], x=top_countries.values[::-1], y=top_countries.index[::-1],
       orientation='h', title="Top 10 Countries by Revenue").show()

The UK clearly dominated, contributing the majority of sales, while other countries played a much smaller role. This suggests that the UK is the main market driving business growth.

Strategy: Prioritize UK customers in marketing and logistics. Use smaller markets to test new promotions or localized campaigns.

Monthly Revenue Trend¶

Next question: “How does revenue change over the last 6 months?”

In [7]:
# Last 6 months
six_months_ago = snapshot_date - pd.DateOffset(months=6)
df_recent = df[df["InvoiceDate"] >= six_months_ago].copy()
df_recent["InvoiceMonth"] = df_recent["InvoiceDate"].dt.to_period("M").astype(str)

# Aggregate monthly sales
monthly_sales = df_recent.groupby("InvoiceMonth")["TotalPrice"].sum().reset_index()
monthly_sales = monthly_sales.sort_values("InvoiceMonth")

# Visualize trend
px.line(monthly_sales, x="InvoiceMonth", y="TotalPrice",
        title="Monthly Sales Trend (Last 6 Months)", markers=True).show()

By aggregating monthly sales and visualizing the trend, I noticed fluctuations, with a noticeable dip in November. This could indicate seasonal variations in demand or potential stock shortages during that period.

Strategy: Plan inventory and promotions in advance to smooth out sales fluctuations and capture peak demand.

Customer Segmentation¶

To answer “How can we group customers for tailored marketing?”, I used RFM (Recency, Frequency, Monetary) analysis.

In [8]:
# RFM calculation
rfm = df.groupby("CustomerID").agg({
    "InvoiceDate": lambda x: (snapshot_date - x.max()).days,
    "InvoiceNo": "nunique",
    "TotalPrice": "sum"
}).rename(columns={"InvoiceDate":"Recency","InvoiceNo":"Frequency","TotalPrice":"Monetary"})

# Scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[["Recency","Frequency","Monetary"]])
In [9]:
from sklearn.cluster import KMeans

# Elbow method
inertia = [KMeans(n_clusters=k, random_state=42).fit(rfm_scaled).inertia_ for k in range(1,8)]
plt.plot(range(1,8), inertia, 'o-')
plt.xlabel("k"); plt.ylabel("Inertia"); plt.title("Elbow Method")
plt.show()

Elbow analyses suggest 3 clusters as the optimal choice.

K-Means Clustering¶

In [10]:
# Train K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
rfm["Segment"] = kmeans.fit_predict(rfm_scaled)

# Segment Profile
rfm_profile = rfm.groupby("Segment").agg({"Recency":"mean","Frequency":"mean","Monetary":["mean","count"]}).round(2)
print(rfm_profile)
        Recency Frequency   Monetary      
           mean      mean       mean count
Segment                                   
0         39.98      4.85    2012.11  3231
1        245.02      1.58     631.14  1093
2          6.14     80.21  122888.41    14
In [11]:
# 3D visualization
fig_rfm3d = px.scatter_3d(rfm, x='Recency', y='Frequency', z='Monetary',
                          color='Segment', size='Monetary', opacity=0.7,
                          title="3D RFM Segments")
fig_rfm3d.show()

Using K-Means clustering, we found three clear segments:

  • VIPs: frequent, high-spending buyers
  • Mid-tier: regular, moderate buyers
  • At-risk: infrequent, low-spending buyers

Strategy:

  • Reward VIPs to retain loyalty.
  • Upsell and cross-sell to mid-tier customers.
  • Reactivate at-risk customers through targeted campaigns.

High-Value Customer Prediction¶

I wanted to answer the question: “Can we predict which customers are likely to be high-value?”

To do this, I defined high-value customers as the top 25% in total spending and built a Random Forest model to predict them.

In [12]:
# Define target
q3 = df.groupby("CustomerID")["TotalPrice"].sum().quantile(0.75)
high_value_customers = df.groupby("CustomerID")["TotalPrice"].sum() > q3
In [13]:
# Features
customer_features = df.groupby("CustomerID").agg({
"InvoiceDate": lambda x: (snapshot_date - x.max()).days,
"InvoiceNo": "nunique",
"UnitPrice": "mean",
"Quantity": "mean",
"Country": lambda x: x.mode()[0]
}).rename(columns={"InvoiceDate":"Recency","InvoiceNo":"Frequency","UnitPrice":"AvgUnitPrice","Quantity":"AvgQuantity","Country":"Country"})

customer_features = pd.get_dummies(customer_features, columns=["Country"], drop_first=True)

X = customer_features
y = high_value_customers.astype(int)

# Scaling
X_scaled = scaler.fit_transform(X)

Train Test split¶

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,
                                                    random_state=42, stratify=y)

Random Forest Classifier¶

In [15]:
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest
clf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced')
clf.fit(X_train, y_train)

# Predict probabilities and classify high-value customers
y_proba = clf.predict_proba(X_test)[:,1]
y_pred = (y_proba >= 0.4).astype(int)  # threshold 0.4 to capture more high-value customers

Model Evaluation¶

In [16]:
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

print("ROC-AUC:", roc_auc_score(y_test, y_proba).round(3))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
ROC-AUC: 0.943
Confusion Matrix:
 [[614  37]
 [ 55 162]]
              precision    recall  f1-score   support

           0       0.92      0.94      0.93       651
           1       0.81      0.75      0.78       217

    accuracy                           0.89       868
   macro avg       0.87      0.84      0.85       868
weighted avg       0.89      0.89      0.89       868

The Random Forest model achieved a ROC-AUC of 0.943, demonstrating strong predictive power. This means we can anticipate which customers are most likely to contribute significantly to revenue.

Strategy: Design personalized retention and upselling campaigns targeting these predicted high-value customers to maximize return on marketing investments.

Cross-Selling Opportunities¶

Using FP-Growth on UK transactions, I identified products frequently bought together, which raises the question: “Which products are often bought together?”.

In [17]:
from mlxtend.frequent_patterns import fpgrowth, association_rules

# Pivot table & filter UK transactions
basket_sets = (df[df['Country']=="United Kingdom"]
               .groupby(['InvoiceNo','Description'])['Quantity']
               .sum().unstack().loc[:, lambda x: x.sum() > 10] > 0)

# FP-Growth & association rules
frequent_itemsets = fpgrowth(basket_sets, min_support=0.01, use_colnames=True)
rules_top = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2) \
                .sort_values('lift', ascending=False).head(10)
rules_top
Out[17]:
antecedents consequents antecedent support consequent support support confidence lift representativity leverage conviction zhangs_metric jaccard certainty kulczynski
540 (HERB MARKER THYME) (HERB MARKER ROSEMARY) 0.010753 0.010873 0.010153 0.944134 86.829038 1.0 0.010036 17.705365 0.999228 0.884817 0.943520 0.938918
541 (HERB MARKER ROSEMARY) (HERB MARKER THYME) 0.010873 0.010753 0.010153 0.933702 86.829038 1.0 0.010036 14.921137 0.999350 0.884817 0.932981 0.938918
915 (REGENCY TEA PLATE GREEN ) (REGENCY TEA PLATE ROSES ) 0.013637 0.015980 0.011534 0.845815 52.930211 1.0 0.011316 6.382074 0.994671 0.637874 0.843311 0.783810
914 (REGENCY TEA PLATE ROSES ) (REGENCY TEA PLATE GREEN ) 0.015980 0.013637 0.011534 0.721805 52.930211 1.0 0.011316 3.545575 0.997040 0.637874 0.717958 0.783810
617 (POPPY'S PLAYHOUSE LIVINGROOM ) (POPPY'S PLAYHOUSE BEDROOM ) 0.012556 0.015619 0.010153 0.808612 51.769856 1.0 0.009956 5.143389 0.993153 0.563333 0.805576 0.729306
616 (POPPY'S PLAYHOUSE BEDROOM ) (POPPY'S PLAYHOUSE LIVINGROOM ) 0.015619 0.012556 0.010153 0.650000 51.769856 1.0 0.009956 2.821270 0.996244 0.563333 0.645550 0.729306
928 (SET OF 3 WOODEN STOCKING DECORATION) (SET OF 3 WOODEN TREE DECORATIONS) 0.014959 0.013757 0.010333 0.690763 50.211536 1.0 0.010127 3.189279 0.994968 0.562092 0.686450 0.720927
929 (SET OF 3 WOODEN TREE DECORATIONS) (SET OF 3 WOODEN STOCKING DECORATION) 0.013757 0.014959 0.010333 0.751092 50.211536 1.0 0.010127 3.957447 0.993755 0.562092 0.747312 0.720927
619 (POPPY'S PLAYHOUSE KITCHEN) (POPPY'S PLAYHOUSE LIVINGROOM ) 0.017301 0.012556 0.010693 0.618056 49.225611 1.0 0.010476 2.585309 0.996934 0.557994 0.613199 0.734865
618 (POPPY'S PLAYHOUSE LIVINGROOM ) (POPPY'S PLAYHOUSE KITCHEN) 0.012556 0.017301 0.010693 0.851675 49.225611 1.0 0.010476 6.625290 0.992142 0.557994 0.849063 0.734865

For example, customers who bought Herb Marker Rosemary often also purchased Herb Marker Thyme.

Strategy: Implement “Frequently Bought Together” recommendations and create bundled promotions to increase average order value.

Conclusion¶

Through the analysis of nearly 398,000 cleaned transactions, we uncovered actionable insights:

  • Best-selling products and the UK market are central to growth.
  • Customer segmentation enables tailored marketing strategies.
  • Predictive modeling helps proactively target high-value customers.
  • Cross-selling patterns provide opportunities for bundling and recommendations.

By turning analytics into clear strategies, this project demonstrates how data-driven decisions can boost revenue, improve customer loyalty, and drive sustainable growth in e-commerce.